In this project, you will apply unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.
This notebook will help you complete this task by providing a framework within which you will perform your analysis steps. In each step of the project, you will see some text describing the subtask that you will perform, followed by one or more code cells for you to complete your work. Feel free to add additional code and markdown cells as you go along so that you can explore everything in precise chunks. The code cells provided in the base template will outline only the major tasks, and will usually not be enough to cover all of the minor tasks that comprise it.
It should be noted that while there will be precise guidelines on how you should handle certain tasks in the project, there will also be places where an exact specification is not provided. There will be times in the project where you will need to make and justify your own decisions on how to treat the data. These are places where there may not be only one way to handle the data. In real-life tasks, there may be many valid ways to approach an analysis task. One of the most important things you can do is clearly document your approach so that other scientists can understand the decisions you've made.
At the end of most sections, there will be a Markdown cell labeled Discussion. In these cells, you will report your findings for the completed section, as well as document the decisions that you made in your approach to each subtask. Your project will be evaluated not just on the code used to complete the tasks outlined, but also your communication about your observations and conclusions at each stage.
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mtick # to format x axis labels with thousand comma separators
# magic word for producing visualizations in notebook
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from workspace_utils import active_session
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import Imputer
import time
There are four files associated with this project (not including this one):
Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md: Detailed information file about the features in the provided datasets.AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columnsEach row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. You will use this information to cluster the general population into groups with similar demographic properties. Then, you will see how the people in the customers dataset fit into those created clusters. The hope here is that certain clusters are over-represented in the customers data, as compared to the general population; those over-represented clusters will be assumed to be part of the core userbase. This information can then be used for further applications, such as targeting for a marketing campaign.
To start off with, load in the demographics data for the general population into a pandas DataFrame, and do the same for the feature attributes summary. Note for all of the .csv data files in this project: they're semicolon (;) delimited, so you'll need an additional argument in your read_csv() call to read in the data properly. Also, considering the size of the main dataset, it may take some time for it to load completely.
Once the dataset is loaded, it's recommended that you take a little bit of time just browsing the general structure of the dataset and feature summary file. You'll be getting deep into the innards of the cleaning in the first major step of the project, so gaining some general familiarity can help you get your bearings.
# Load in the general demographics data.
azdias = pd.read_csv("Udacity_AZDIAS_Subset.csv", sep=';')
# Load in the feature summary file.
feat_info = pd.read_csv("AZDIAS_Feature_Summary.csv", sep=';')
# my copy to play around.
# Resaving it cus the original is hard to open in Excel becauase of sep=';'
# feat_info.to_excel("AZDIAS_Feature_Summary_RB_Copy.xlsx")
# this takes forever to run due to size of dataset
# azdias.to_excel("azdias_RB_Copy.xlsx")
Configure how much of a dataframe the Notebook shows
https://discuss.analyticsvidhya.com/t/how-to-display-full-dataframe-in-pandas/23298
https://www.ritchieng.com/pandas-changing-display-options/
https://stackoverflow.com/questions/11707586/how-do-i-expand-the-output-display-to-see-more-columns
https://stackoverflow.com/questions/49216197/jupyter-notebook-has-become-very-slow-suddenly
be careful as high limits may slow down the notebook example from Stack Overflow
For example
pd.set_option('display.max_columns', 50000) was causing serious time issues.
I changed it to
pd.set_option('display.max_columns', 50) and problem solved.
In my personal experience,
pd.set_option('display.width', 1000) # this was causing problems
this improved Jupyter slowness pd.set_option('display.width', 50)
# set pandas options so I can view more of a dataframe
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 200)
azdias.head(10)
azdias.tail(10)
# sample dataframe
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html
azdias.sample(n=20, random_state=42)
def explore_data(df):
"""Explore data. Print basic information and descriptive stats."""
print("Shape: rows, cols")
print(df.shape)
print("")
print()
print("Dataframe Information:")
print(df.info())
print()
print("Desciptive stats:")
print(df.describe())
explore_data(azdias)
azdias.describe().transpose()
def value_counts_in_each_column(df):
"""Give value counts for each column
to see mode and how many categories there are
"""
for column in df.columns.tolist():
print(column)
print(df[column].value_counts())
print()
value_counts_in_each_column(azdias)
def bar_graph_for_each_column(df,column_list='all'):
"""Loop thru all columns in the dataset. Give frequencies of values in columns.
df - you data
column_list = which columns do you want to look at (pass a list). Default = "all"
Y axis - unique values in column
X axis - frequency of that value
https://etav.github.io/python/count_basic_freq_plot.html
https://stackoverflow.com/questions/38152356/matplotlib-dollar-sign-with-thousands-comma-tick-labels/38152510
https://stackoverflow.com/questions/50169311/getting-percentage-and-count-python
"""
if column_list == 'all':
column_list = df.columns.tolist()
for column in column_list:
fig, ax = plt.subplots(1, 1,figsize=(10, 5))
df[column].value_counts().plot('barh').invert_yaxis()
plt.title(column)
ax.set_ylabel('Value')
ax.set_xlabel('Frequency')
fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick)
plt.show()
# print(df[column].value_counts())
summary = pd.concat([df[column].value_counts(), df[column].value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))
print(column)
print(summary)
bar_graph_for_each_column(azdias,column_list=['AGER_TYP', 'RELAT_AB'])
# bar_graph_for_each_column?
bar_graph_for_each_column(azdias)
explore_data(feat_info)
feat_info
feat_info.head(10)
feat_info
# bar_graph_for_each_column?
bar_graph_for_each_column(feat_info, column_list=['information_level','type'])
Most data is about a person and is ordinal. Second group, after ordinal, is categorical.
Tip: Add additional cells to keep everything in reasonably-sized chunks! Keyboard shortcut
esc --> a(press escape to enter command mode, then press the 'A' key) adds a new cell before the active cell, andesc --> badds a new cell after the active cell. If you need to convert an active cell to a markdown cell, useesc --> mand to convert to a code cell, useesc --> y.
The feature summary file contains a summary of properties for each demographics data column. You will use this file to help you make cleaning decisions during this stage of the project. First of all, you should assess the demographics data in terms of missing data. Pay attention to the following points as you perform your analysis, and take notes on what you observe. Make sure that you fill in the Discussion cell with your findings and decisions at the end of each step that has one!
The fourth column of the feature attributes summary (loaded in above as feat_info) documents the codes from the data dictionary that indicate missing or unknown data. While the file encodes this as a list (e.g. [-1,0]), this will get read in as a string object. You'll need to do a little bit of parsing to make use of it to identify and clean the data. Convert data that matches a 'missing' or 'unknown' value code into a numpy NaN value. You might want to see how much data takes on a 'missing' or 'unknown' code, and how much data is naturally missing, as a point of interest.
As one more reminder, you are encouraged to add additional cells to break up your analysis into manageable chunks.
# how many missing (NAN) values there are in each column
# this is BEFORE we convert categories which are not misssing values, but denote missing or unknown information
# This is how much data is NATURALLY missing (Python NAN, SQL null, Excel blank)
azdias.isnull().sum()
# this will give you the same result:
# azdias.isna().sum()
# If you add it up to azdias.info() non-null, you will get 891221 in each row.
# how many missing (NAN) values there are in each column
# this is BEFORE we convert categories which are not misssing values, but denote missing or unknown information
# This is how much data is NATURALLY missing (Python NAN, SQL null, Excel blank)
def how_many_NA(df):
"""Loop thru columns. Give number of missing NA values.
his function tells you how many missing values there are in each column.
if you add it up to azdias.info() non-null, you will get 891221 in each row.
This is an extended version of the function above. It also calculates % of missing columns.
Returns a dataframe/report on NAN
"""
missing_NA_list = []
missing_NA_percent_list = []
for column in df.columns.tolist():
missing_NA = df[column].isna().sum()
missing_NA_percent = missing_NA / len(df)
missing_NA_list.append(missing_NA)
missing_NA_percent_list.append(missing_NA_percent)
missing_value_report_df = pd.DataFrame(
{'Column': df.columns.tolist(),
'missing_NA': missing_NA_list,
'missing_NA_percent': missing_NA_percent_list
}
)
return missing_value_report_df
azdias_NA_report = how_many_NA(azdias)
azdias_NA_report
# how_many_NA?
# save to Excel
# azdias_NA_report.to_excel("azdias_NA_report.xlsx")
"""https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
"""
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_NA_report['Column'], azdias_NA_report['missing_NA'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('NATURALLY missing data (NAN) in the ORIGINAL dataset', fontsize=18)
fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick)
ax.invert_yaxis()
plt.show()
"""https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
"""
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_NA_report['Column'], azdias_NA_report['missing_NA_percent'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN_percent', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('NATURALLY missing data (NAN) in the ORIGINAL dataset - PERCENT', fontsize=18)
ax.invert_yaxis()
plt.show()
"""
this is like VLOOKUP in Excel. Look up value from [missing_or_unknown] column of
feat_info by value in [attribute] column of feat_info
https://stackoverflow.com/questions/18327624/find-elements-index-in-pandas-series
these values match feat_info.missing_or_unknown values
"""
for column in azdias.columns.tolist():
print(feat_info.missing_or_unknown[feat_info[feat_info.attribute == column].index[0]])
How much missing data is present in each column? There are a few columns that are outliers in terms of the proportion of values that are missing. You will want to use matplotlib's hist() function to visualize the distribution of missing value counts to find these columns. Identify and document these columns. While some of these columns might have justifications for keeping or re-encoding the data, for this project you should just remove them from the dataframe. (Feel free to make remarks about these outlier columns in the discussion, however!)
For the remaining features, are there any patterns in which columns have, or share, missing data?
def how_many_coded_uknown_or_missing(df):
""""loop thru columns of df.
Look at which specific values are codes for missing and uknown.
How many values are coded as missing or unknown? Print report.
"""
# initialize empty lists
coded_unknown_or_missing_list = []
coded_unknown_or_missing_percent_list = []
i = 0
# loop thu each column. Which values indicate missing and uknown?
# how many values are coded as missing and uknown? (They would need to be replaced with NAN)
for column in df.columns.tolist():
# this is just the number of column
print("Colum number:", i)
i=i+1
list_NAN_for_this_column = feat_info.missing_or_unknown[feat_info[feat_info.attribute == column].index[0]]
print(column)
print("missing' or 'unknown' code BEFORE parsing/cleaning the string")
print(list_NAN_for_this_column)
print(type(list_NAN_for_this_column))
list_NAN_for_this_column = list_NAN_for_this_column.strip('[').strip(']').split(",")
print("missing' or 'unknown' code AFTER parsing/cleaning the string")
print(list_NAN_for_this_column)
print(type(list_NAN_for_this_column))
print("Sum of values in the original dataset which are coded as unknown or missing.")
coded_unknown_or_missing = df[column].isin(list_NAN_for_this_column).sum()
print(coded_unknown_or_missing)
print("#############################################################################")
print()
# append sum of values are coded as missing and uknown to the list
coded_unknown_or_missing_list.append(coded_unknown_or_missing)
coded_unknown_or_missing_percent = coded_unknown_or_missing / len(df)
coded_unknown_or_missing_percent_list.append(coded_unknown_or_missing_percent)
# create df
coded_missing_unknown_report_df = pd.DataFrame(
{
'Column': df.columns.tolist(),
'Coded_Uknown_or_Missing': coded_unknown_or_missing_list,
'Coded_Uknown_or_Missing_Percent': coded_unknown_or_missing_percent_list
}
)
return coded_missing_unknown_report_df
azdias_coded_unknown_or_missing_report = how_many_coded_uknown_or_missing(azdias)
# azdias_coded_unknown_or_missing_report.to_excel("azdias_coded_unknown_or_missing_report.xlsx")
azdias_coded_unknown_or_missing_report
"""https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
"""
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_coded_unknown_or_missing_report['Column'], azdias_coded_unknown_or_missing_report['Coded_Uknown_or_Missing'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('Coded_Uknown_or_Missing', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Data CODED AS UNKNOWN OR MISSING in the ORIGINAL dataset', fontsize=18)
fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick)
ax.invert_yaxis()
plt.show()
def replace_coded_as_missing_unknown_with_NANs(df):
"""
copy dataframe
in the copy, replace missing and uknown codes with NAN
int64 gets replaced with float64, not sure why, probably cus NAN corresponds to a float
https://stackoverflow.com/questions/53819909/pandas-replace-values-in-dataframe-conditionally-based-on-string-compare
https://stackoverflow.com/questions/41870093/pandas-shift-converts-my-column-from-integer-to-float
"""
# create a copy
df_cleaned = df.copy()
for column in df_cleaned.columns.tolist():
# which values in this columnn are indicating missing or uknown?
list_NAN_for_this_column = feat_info.missing_or_unknown[feat_info[feat_info.attribute == column].index[0]]
# clean up the missing and unkown codes
list_NAN_for_this_column = list_NAN_for_this_column.strip('[').strip(']').split(",")
# replace with NAN. 2 methods. Link in docustring from Stack Overflow. Both work well and been tested
# df_cleaned[column] = np.where(df_cleaned[column].isin(list_NAN_for_this_column), np.nan, df_cleaned[column])
df_cleaned.loc[df_cleaned[column].isin(list_NAN_for_this_column), column] = np.nan
return df_cleaned
azdias_cleaned = replace_coded_as_missing_unknown_with_NANs(azdias)
def check_column(df, df_cleaned, column):
"""check if i replaced the right values with NAN"""
print("delete these values:", feat_info.missing_or_unknown[feat_info[feat_info.attribute == column].index[0]])
print()
print(df[column].value_counts())
print()
print(df_cleaned[column].value_counts())
check_column(azdias, azdias_cleaned, 'AGER_TYP')
for column in ['ALTERSKATEGORIE_GROB', 'ORTSGR_KLS9', 'RELAT_AB', 'ANREDE_KZ']:
check_column(azdias, azdias_cleaned, column)
print('################################')
print()
# Perform an assessment of how much missing data there is in each column of the
# dataset.
azdias_cleaned_NA_report = how_many_NA(azdias_cleaned)
azdias_cleaned_NA_report
# azdias_cleaned_NA_report.to_excel("azdias_cleaned_NA_report.xlsx")
# Investigate patterns in the amount of missing data in each column.
"""https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
"""
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_cleaned_NA_report['Column'], azdias_cleaned_NA_report['missing_NA'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('MISSING/NAN data in the dataset AFTER replacing missing/uknown codes with NAN', fontsize=18)
fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick)
ax.invert_yaxis()
plt.show()
# Investigate patterns in the amount of missing data in each column.
"""
red bars indicate columns with more than 33% missing data
https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
https://stackoverflow.com/questions/22341271/get-list-from-pandas-dataframe-column
https://stackoverflow.com/questions/3832809/how-to-change-the-color-of-a-single-bar-if-condition-is-true-matplotlib
"""
##############################################################################
# in a visualization below, outliers will be marked in red.
# Outlier is a column wiht more than 33% missing data
condition = azdias_cleaned_NA_report['missing_NA_percent']>0.33
azdias_cleaned_NA_report['colour'] = np.where(condition==True, 'red', 'blue')
##############################################################################
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
clrs = list(azdias_cleaned_NA_report['colour'])
clrs
plt.barh(azdias_cleaned_NA_report['Column'], azdias_cleaned_NA_report['missing_NA_percent'], align='center', alpha=0.5, color = clrs)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN_percent', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('MISSING/NAN data in the dataset after replacing missing/unknown codes with NAN - PERCENT', fontsize=18)
ax.invert_yaxis()
plt.show()
# Investigate patterns in the amount of missing data in each column.
# the following columns have more than 1/3 of data missing
outliers_df = azdias_cleaned_NA_report[azdias_cleaned_NA_report['missing_NA_percent']>0.333]
outliers_df
# Remove the outlier columns from the dataset. (You'll perform other data
# engineering tasks such as re-encoding and imputation later.)
print("BEFORE removing outlier colums, shape: ", azdias_cleaned.shape)
for outlier_column in outliers_df['Column'].tolist():
del(azdias_cleaned[outlier_column])
print("AFTER removing outlier colums, shape: ", azdias_cleaned.shape)
Removed 6 columns with too much missing data
azdias_cleaned_NA_report = how_many_NA(azdias_cleaned)
azdias_cleaned_NA_report
"""https://pythonspot.com/matplotlib-bar-chart/
https://stackoverflow.com/questions/12444716/how-do-i-set-the-figure-title-and-axes-labels-font-size-in-matplotlib
https://stackoverflow.com/questions/28022227/sorted-bar-charts-with-pandas-matplotlib-or-seaborn
"""
fig, ax = plt.subplots(1, 1,figsize=(15, 30))
plt.barh(azdias_cleaned_NA_report['Column'], azdias_cleaned_NA_report['missing_NA_percent'], align='center', alpha=0.5)
plt.ylabel('Column', fontsize=18)
plt.xlabel('NAN_percent', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('MISSING/NAN data in the dataset - deleted NAN outlier columns', fontsize=18)
ax.invert_yaxis()
plt.show()
(Double click this cell and replace this text with your own text, reporting your observations regarding the amount of missing data in each column. Are there any patterns in missing values? Which columns were removed from the dataset?)
The following columns have been removed. They all have more than 33% missing data
3 of them are related to age: AGER_TYP, GEBURTSJAHR and ALTER_HH.
KK_KUNDERTYP has the largest amount of natually missing (NA) data (66%).
AGE_TYP and TITEL_KUZ have the largest amount of data coded as missing or unknown.
An odd thing is that, once you remove the outlier columns, if you sort all variables alphabetically, then variables with missing values are grouped together in 3 groups. You can see this above in the chart MISSING/NAN data in the dataset - deleted NAN outlier columns. This could be just happenstance. It's like we have 3 groups or clusters. This is just a metaphor/comparison I use here, I know it's not a cluster in the statistical sense of the word.
Now, you'll perform a similar assessment for the rows of the dataset. How much data is missing in each row? As with the columns, you should see some groups of points that have a very different numbers of missing values. Divide the data into two subsets: one for data points that are above some threshold for missing values, and a second subset for points below that threshold.
In order to know what to do with the outlier rows, we should see if the distribution of data values on columns that are not missing data (or are missing very little data) are similar or different between the two groups. Select at least five of these columns and compare the distribution of values.
countplot() function to create a bar chart of code frequencies and matplotlib's subplot() function to put bar charts for the two subplots side by side.Depending on what you observe in your comparison, this will have implications on how you approach your conclusions later in the analysis. If the distributions of non-missing features look similar between the data with many missing values and the data with few or no missing values, then we could argue that simply dropping those points from the analysis won't present a major issue. On the other hand, if the data with many missing values looks very different from the data with few or no missing values, then we should make a note on those data as special. We'll revisit these data later on. Either way, you should continue your analysis for now using just the subset of the data with few or no missing values.
def count_NAN_in_each_ROW(df):
"""
Count number of missing values in each row.
Add a column NAN_count.
https://datascience.stackexchange.com/questions/12645/how-to-count-the-number-of-missing-values-in-each-row-in-pandas-dataframe"""
# number of columns MINUS smth like Excel COUNTA function (should count both text and numbers)
df['NAN_count'] = df.shape[1] - df.apply(lambda x: x.count(), axis=1)
# test the function above
test = azdias_cleaned.sample(n=100, random_state=42)
count_NAN_in_each_ROW(test)
# test.to_excel("test.xlsx")
# test
# TESTING THE FUNCTION ABOVE - DONT USE FOR ANALYSIS
# sns.set(style="darkgrid")
# fig, ax = plt.subplots(1, 1,figsize=(20, 10))
# sns.countplot(x="NAN_count", data=test, color = 'blue', alpha=0.5)
# plt.ylabel('Frequency', fontsize=18)
# plt.xlabel('NAN Values in Row', fontsize=18)
# plt.xticks(fontsize=15)
# plt.yticks(fontsize=15)
# plt.title('Missing Data in Rows TEST DONT USE FOR ANALYSIS', fontsize=18)
# plt.show()
"""https://stackoverflow.com/questions/20461165/how-to-convert-index-of-a-pandas-dataframe-into-a-column
https://pandas.pydata.org/pandas-docs/version/0.18/generated/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values"""
def row_wise_NA_count_summary_report(df):
row_wise_NA_count = df['NAN_count'].value_counts().to_frame()
row_wise_NA_count= row_wise_NA_count.reset_index()
row_wise_NA_count.columns = ['NAN Count in Row', 'Frequency']
row_wise_NA_count=row_wise_NA_count.sort_values(by ='NAN Count in Row')
return row_wise_NA_count
# """TEST - DONT USE FOR ANALYSIS"""
# row_wise_NA_count_summary_report_test = row_wise_NA_count_summary_report(test)
# row_wise_NA_count_summary_report_test
# this takes a while to run, but not too bad, just like 3-5 mins
count_NAN_in_each_ROW(azdias_cleaned)
# I added a column at the very end. NAN_count
azdias_cleaned.head(50)
# checking my NAN_count in each row
azdias_cleaned.isnull().sum(axis=1)
row_wise_NA_count_summary_report_azdias_cleaned = row_wise_NA_count_summary_report(azdias_cleaned)
row_wise_NA_count_summary_report_azdias_cleaned
# this is an aggregation. How many missing values per row vs frequency
len(azdias_cleaned) -623209
print(sum(azdias_cleaned.isnull().any(axis=1)))
# 268012 rows have missing values
# 30% of rows have missing values - this calculation is manual and hardcoded
(len(azdias_cleaned) -623209) / len(azdias_cleaned)
print(azdias_cleaned['NAN_count'].describe())
sns.set(style="darkgrid")
fig, ax = plt.subplots(1, 1,figsize=(20, 10))
sns.countplot(x="NAN_count", data=azdias_cleaned, color = 'blue', alpha=0.5)
plt.ylabel('Frequency', fontsize=18)
plt.xlabel('NAN Values in Row', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Missing Data in Rows', fontsize=18)
plt.show()
fig, ax = plt.subplots(1, 1,figsize=(20, 10))
sns.countplot(x="NAN_count", data=azdias_cleaned[azdias_cleaned['NAN_count']>0], color = 'blue', alpha=0.5)
plt.ylabel('Frequency', fontsize=18)
plt.xlabel('NAN Values in Row', fontsize=18)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.title('Missing Data in Rows - Only Rows with Missing Data', fontsize=18)
plt.show()
This is just my screenshot, where I can draw things on
Notice how there are 3 "clusters" of data depending on how many values with missing data there are in each row. We can remove data with missing rows, depending on our threshold. Possible thresholds are:
29-30ish
40-42ish
9-10 ish is also possible, to tolerate very few missing values
Below I will explore what's going to happen with different thresholds: how much data will be dropped.
azdias_cleaned.head()
azdias_cleaned['NAN_count'].describe()
# Write code to divide the data into two subsets based on the number of missing
# values in each row.
def divide_into_subsets(df_cleaned, threshold):
"""divide data into 2 subsets depending on how many missing values there in row
provide threshold
"""
condition = df_cleaned['NAN_count'] > threshold
df_cleaned['many_missing_values_in_row'] = np.where(condition==True, 1, 0)
print("Threshold of missing rows in the row: ", threshold)
print()
print('Total nrow:', len(df_cleaned['many_missing_values_in_row']))
print()
df_cleaned_missing_few = df_cleaned[df_cleaned['many_missing_values_in_row'] ==0]
df_cleaned_missing_many = df_cleaned[df_cleaned['many_missing_values_in_row'] ==1]
print("Few Missing Values. Will be kept.")
print(len(df_cleaned_missing_few['NAN_count']))
print()
print("Lots of Missing Values. Will be deleted")
print(len(df_cleaned_missing_many['NAN_count']))
print()
print("% of rows with lots of missing values. % of data deleted ")
print(df_cleaned['many_missing_values_in_row'].mean())
return df_cleaned_missing_few, df_cleaned_missing_many, threshold
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold = divide_into_subsets(azdias_cleaned, 30)
# threshold: 9
# keep: 774743
# drop: 116478 (0.130694855709)
# threshold: 29
# keep: 798061
# drop: 93160 (0.104530750509694)
# threshold: 30
# keep: 798067
# drop: 93154 (0.104524018173)
# threshold: 42
# keep: 817622
# drop: 73599 (0.08258221024863642)
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold = divide_into_subsets(azdias_cleaned, 30)
azdias_cleaned.head()
azdias_cleaned_missing_few.head()
# Investigate patterns in the amount of missing data in each column.
# Very few missing columns
no_or_very_few_NAN = azdias_cleaned_NA_report[azdias_cleaned_NA_report['missing_NA_percent']<0.025]
no_or_very_few_NAN
no_or_very_few_NAN['missing_NA_percent'].max()
# THIS RUNS on my machine but not in Udacity workspace
# AttributeError: module 'seaborn' has no attribute 'catplot'
# """https://seaborn.pydata.org/generated/
# seaborn.countplot.html"""
# i = 0
# for variable in no_or_very_few_NAN['Column'].tolist():
# if i > 5:
# break
# sns.catplot(x=variable, col="many_missing_values_in_row",
# data=azdias_cleaned, kind="count",
# height=4, aspect=.7)
# plt.show()
# i = i+1
# Compare the distribution of values for at least five columns where there are
# no or few missing values, between the two subsets.
def pairwise_comparison(stop_after = 5):
"""create countplots of columns with few or no missing values
for two datasets: with few and with many missing values in rows
stops after several variables
indicates a threshold you picked above (threshold separates two subsets)
"""
sns.set(style="white", palette="muted", color_codes=True)
i = 0
for variable in no_or_very_few_NAN['Column'].tolist():
# testing and running this check on only a few variables
if i > stop_after:
break
# pass
fig, ax = plt.subplots(1, 2,figsize=(15, 7))
plt.subplot(1,2,1)
sns.countplot(x=variable, data=azdias_cleaned_missing_few, color = 'blue', alpha=0.5)
plt.ylabel('Frequency', fontsize=14)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
title = 'FEW missing values in rows, threshold: ' + str(threshold)
plt.title(title, fontsize=18)
plt.subplot(1,2,2)
sns.countplot(x=variable, data=azdias_cleaned_missing_many, color = 'blue', alpha=0.5)
plt.ylabel('Frequency', fontsize=14)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
title = 'MANY missing values in rows, threshold: ' + str(threshold)
plt.title(title, fontsize=18)
plt.show()
i = i+1
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold = divide_into_subsets(azdias_cleaned, 9)
pairwise_comparison(10)
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold = divide_into_subsets(azdias_cleaned, 30)
pairwise_comparison(10)
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold = divide_into_subsets(azdias_cleaned, 42)
pairwise_comparison(10)
# I'm re-running this code to make sure I'm actually usingng the threshold I decided on
# (I was playing a lot above with different thresholds)
azdias_cleaned_missing_few, azdias_cleaned_missing_many, threshold = divide_into_subsets(azdias_cleaned, 30)
azdias_cleaned_missing_few.shape
(Double-click this cell and replace this text with your own text, reporting your observations regarding missing data in rows. Are the data with lots of missing values are qualitatively different from data with few or no missing values?)
Data with lots of missing values in rows appear to be qualitatively different from data with few or no missing values. You can see this if you compare the above distributions side by side.
There are 3 clusters of data depending on how many values with missing data there are in each row.
I we can split the dataset into two subsets, depending on how much missing data there are in each row. We can use several possible thresholds: 30 and 42.
30 would split cluster 1 from the rest of the data (~10%).
42 would split cluster 3 from the rest of the data (~8% of the data). This option was used to subset the data. 8% of data has been removed. 817622 rows out of 891221 remained.
I ended up going with threshold of 30, to minimize missing data, while still following hte logic of my analysis (the 3 cluster of dataset based on missing data).
Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.
In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.
Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!
feat_info.head()
# How many features are there of each data type?
feat_info['type'].value_counts()
For categorical data, you would ordinarily need to encode the levels as dummy variables. Depending on the number of categories, perform one of the following:
categorical_vars = feat_info[feat_info['type']=='categorical']
len(categorical_vars)
list(categorical_vars['attribute'])
# Assess categorical variables: which are binary, which are multi-level, and
# which one needs to be re-encoded?
"""
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.isnull.html
#https://stackoverflow.com/questions/36198118/np-isnan-on-arrays-of-dtype-object
"""
categorical_vars = feat_info[feat_info['type']=='categorical']
len(categorical_vars)
categorical_vars = list(categorical_vars['attribute'])
binary = []
multilevel = []
for variable in categorical_vars:
print(variable)
print()
# is this categorical column removed from the cleaned dataset? Because it has too much missing info
if variable in (list(outliers_df['Column'])):
print("This column has been removed from azdias_cleaned_missing_few as having too much missing data.")
else:
# these levels includes nan
unique_levels = list(azdias_cleaned_missing_few[variable].unique())
# number of unique levels, excluding nan
number_of_unique_levels = len(unique_levels) - sum(pd.isnull(unique_levels))
print("Number of Unique levels: ", number_of_unique_levels)
print()
# is it multilevel of binary?
if number_of_unique_levels > 2:
print("Multilevel Categorical")
multilevel.append(variable)
else:
print("Binary Categorical")
binary.append(variable)
print()
print("Unique Levels: ", unique_levels)
print()
print("###################################################################")
print("Binary Categoricals: ")
print(binary)
print()
print("Multilevel Categoricals: ")
print(multilevel)
This one needs to be recoded:
OST_WEST_KZ
Number of Unique levels: 2
Binary
Unique Levels: [nan, 'W', 'O']
azdias_cleaned_missing_few.head()
# Re-encode categorical variable(s) to be kept in the analysis.
# print(azdias_cleaned_missing_few['OST_WEST_KZ'].head(100))
azdias_cleaned_missing_few['OST_WEST_KZ'].tail(100)
print(azdias_cleaned.shape)
print(azdias_cleaned_missing_few.shape)
print(azdias_cleaned.OST_WEST_KZ.value_counts())
bar_graph_for_each_column(azdias_cleaned,column_list=['OST_WEST_KZ'])
# this gives me the same result:
# print(azdias_cleaned_missing_few.OST_WEST_KZ.value_counts())
# bar_graph_for_each_column(azdias_cleaned_missing_few,column_list=['OST_WEST_KZ'])
"""Deep Learning. Lesson 1 Intro to Neural Networks. 36 Notebook: Analyzing Student Data"""
# Make dummy variables for OST_WEST_KZ
azdias_cleaned_encoded = pd.concat([azdias_cleaned_missing_few, pd.get_dummies(azdias_cleaned_missing_few['OST_WEST_KZ'], prefix='OST_WEST_KZ')], axis=1)
azdias_cleaned_encoded.shape
azdias_cleaned_encoded.head()
azdias_cleaned_encoded[['OST_WEST_KZ', 'OST_WEST_KZ_O', 'OST_WEST_KZ_W']].tail(100)
pd.crosstab(azdias_cleaned_encoded.OST_WEST_KZ, azdias_cleaned_encoded.OST_WEST_KZ_O)
pd.crosstab(azdias_cleaned_encoded.OST_WEST_KZ, azdias_cleaned_encoded.OST_WEST_KZ_W)
pd.crosstab(azdias_cleaned_encoded.OST_WEST_KZ_O, azdias_cleaned_encoded.OST_WEST_KZ_W)
# Drop the previous OST_WEST_KZ column
print(azdias_cleaned_encoded.shape)
# we actually don't need both dummy vars. 2nd one doesn't add anything. We can drop OST_WEST_KZ_W and keep OST_WEST_KZ_O
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['OST_WEST_KZ_W','OST_WEST_KZ'], axis = 1)
print(azdias_cleaned_encoded.shape)
print(len(multilevel))
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(multilevel, axis = 1)
print(azdias_cleaned_encoded.shape)
(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding categorical features. Which ones did you keep, which did you drop, and what engineering steps did you perform?)
Categorical variables
Kept untouched
Binary Categoricals: ['ANREDE_KZ', 'GREEN_AVANTGARDE', 'SOHO_KZ', 'VERS_TYP']
Recoded as dummy variable
Binary Categorical: 'OST_WEST_KZ'
Dropped from analysis just to keep things straighforward and also avoid overfitting due to having too many independent variables:
Multilevel Categoricals: ['CJT_GESAMTTYP', 'FINANZTYP', 'GFK_URLAUBERTYP', 'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB', 'LP_STATUS_FEIN', 'LP_STATUS_GROB', 'NATIONALITAET_KZ', 'SHOPPER_TYP', 'ZABEOTYP', 'GEBAEUDETYP', 'CAMEO_DEUG_2015', 'CAMEO_DEU_2015']
When I was working on the very last part of the project, description of the customer clusters, I regretted dropping all of these vars. I should have kept a couple of them at least, indicating shopper type.
There are a handful of features that are marked as "mixed" in the feature summary that require special treatment in order to be included in the analysis. There are two in particular that deserve attention; the handling of the rest are up to your own choices:
Be sure to check Data_Dictionary.md for the details needed to finish these tasks.
# How many features are there of each data type?
feat_info['type'].value_counts()
azdias_cleaned_encoded["PRAEGENDE_JUGENDJAHRE"].value_counts()
# for item in my_dict:
# print(item)
# print(my_dict[item])
def recode_column_according_2_my_dict(df, oldvar, newvar, my_dict):
"""recode a column according to how it's specified in my dictionary my_dict"""
for item in my_dict:
condition = df[oldvar].isin(my_dict[item])
df[newvar] = np.where(condition==True, item, df[newvar])
"""
https://stackoverflow.com/questions/33271098/python-get-a-frequency-count-based-on-two-columns-variables-in-pandas-datafra
"""
return pd.crosstab(df[oldvar], df[newvar])
#https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column
# initialize column
azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_DECADE'] = 0
# here's how the column will be recoded
my_dict = {
40:[1,2],
50:[3,4],
60:[5,6,7],
70:[8,9],
80:[10,11,12,13],
90:[14,15]
}
recode_column_according_2_my_dict(azdias_cleaned_encoded, "PRAEGENDE_JUGENDJAHRE", "PRAEGENDE_JUGENDJAHRE_DECADE", my_dict)
azdias_cleaned_encoded[["PRAEGENDE_JUGENDJAHRE","PRAEGENDE_JUGENDJAHRE_DECADE"]].head(100)
azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'] = ""
my_dict ={
"AVANTGARDE": [2, 4, 6, 7, 9, 11, 13, 15],
"MAINSTREAM": [1, 3, 5, 8, 10, 12, 14]
}
recode_column_according_2_my_dict(azdias_cleaned_encoded, "PRAEGENDE_JUGENDJAHRE", "PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP", my_dict)
azdias_cleaned_encoded[["PRAEGENDE_JUGENDJAHRE","PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP"]].head(100)
"""Deep Learning. Lesson 1 Intro to Neural Networks. 36 Notebook: Analyzing Student Data"""
# Make dummy variables for PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP
azdias_cleaned_encoded = pd.concat([azdias_cleaned_encoded, pd.get_dummies(azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], prefix='PRAEGENDE_JUGENDJAHRE_MOVEMENT')], axis=1)
pd.crosstab(azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE'])
pd.crosstab(azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM'])
pd.crosstab(azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE'], azdias_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM'])
azdias_cleaned_encoded["PRAEGENDE_JUGENDJAHRE"].isna().sum()
# Drop the previous PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP column.
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], axis = 1)
print(azdias_cleaned_encoded.shape)
# Drop the previous PRAEGENDE_JUGENDJAHRE
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop('PRAEGENDE_JUGENDJAHRE', axis = 1)
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded.head()
azdias_cleaned_encoded.PRAEGENDE_JUGENDJAHRE_MOVEMENT_.describe()
azdias_cleaned_encoded.PRAEGENDE_JUGENDJAHRE_MOVEMENT_.sum()
# Drop the previous PRAEGENDE_JUGENDJAHRE_MOVEMENT_. This column just indicated sum(NAN) in PRAEGENDE_JUGENDJAHRE
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop('PRAEGENDE_JUGENDJAHRE_MOVEMENT_', axis = 1)
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded.head()
azdias_cleaned_encoded["CAMEO_INTL_2015"].value_counts()
azdias_cleaned_encoded.info()
"""
https://stackoverflow.com/questions/41271299/how-can-i-get-the-first-two-digits-of-a-number
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html
"""
azdias_cleaned_encoded["CAMEO_INTL_2015_WEALTH"] =azdias_cleaned_encoded["CAMEO_INTL_2015"].str.slice(0,1)
azdias_cleaned_encoded["CAMEO_INTL_2015_LIFE_STAGE_TYP"] =azdias_cleaned_encoded["CAMEO_INTL_2015"].str.slice(1,2)
azdias_cleaned_encoded[["CAMEO_INTL_2015", "CAMEO_INTL_2015_WEALTH","CAMEO_INTL_2015_LIFE_STAGE_TYP"]].head(10)
pd.crosstab(azdias_cleaned_encoded["CAMEO_INTL_2015"], azdias_cleaned_encoded["CAMEO_INTL_2015_WEALTH"])
pd.crosstab(azdias_cleaned_encoded["CAMEO_INTL_2015"], azdias_cleaned_encoded["CAMEO_INTL_2015_LIFE_STAGE_TYP"])
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop('CAMEO_INTL_2015', axis = 1)
print(azdias_cleaned_encoded.shape)
mixed_vars = feat_info[feat_info['type']=='mixed']
len(mixed_vars)
list(mixed_vars['attribute'])
# PRAEGENDE_JUGENDJAHRE and CAMEO_INTL_2015 has already been taken care of
# ['LP_LEBENSPHASE_FEIN',
# 'LP_LEBENSPHASE_GROB',
# 'PRAEGENDE_JUGENDJAHRE', Converted by PRAEGENDE_JUGENDJAHRE_DECADE and PRAEGENDE_JUGENDJAHRE_MOVEMENT
# 'WOHNLAGE',
# 'CAMEO_INTL_2015', Converted to CAMEO_INTL_2015_WEALTH and CAMEO_INTL_2015_LIFE_STAGE_TYP
# 'KBA05_BAUMAX',
# 'PLZ8_BAUMAX']
# 8.6. PLZ8_BAUMAX
# Most common building type within the PLZ8 region
#
# -1: unknown
# 0: unknown
# 1: mainly 1-2 family homes
# 2: mainly 3-5 family homes
# 3: mainly 6-10 family homes
# 4: mainly 10+ family homes
# 5: mainly business buildings
azdias_cleaned_encoded["PLZ8_BAUMAX"].value_counts()
azdias_cleaned_encoded["PLZ8_BAUMAX"].isna().sum()
my_dict ={
"FAMILY": [1,2,3,4],
"BUSINESS": [5]
}
azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'] = ""
recode_column_according_2_my_dict(azdias_cleaned_encoded, "PLZ8_BAUMAX", "PLZ8_BAUMAX_BLDNG_TYPE_TEMP", my_dict)
azdias_cleaned_encoded.head()
azdias_cleaned_encoded = pd.concat([azdias_cleaned_encoded, pd.get_dummies(azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'], prefix='PLZ8_BAUMAX_BLDNG_TYPE')], axis=1)
pd.crosstab(azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'], azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS'])
pd.crosstab(azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'], azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_FAMILY'])
pd.crosstab(azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS'], azdias_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_FAMILY'])
azdias_cleaned_encoded.head()
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['PLZ8_BAUMAX_BLDNG_TYPE_TEMP','PLZ8_BAUMAX_BLDNG_TYPE_'], axis = 1)
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded.head()
pd.crosstab(azdias_cleaned_encoded.PLZ8_BAUMAX, azdias_cleaned_encoded.PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS)
pd.crosstab(azdias_cleaned_encoded.PLZ8_BAUMAX, azdias_cleaned_encoded.PLZ8_BAUMAX_BLDNG_TYPE_FAMILY)
# how many family homes are there?
condition = azdias_cleaned_encoded['PLZ8_BAUMAX'] == 5
azdias_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'] = azdias_cleaned_encoded['PLZ8_BAUMAX']
azdias_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'] = np.where(condition==True, 0, azdias_cleaned_encoded['PLZ8_BAUMAX'])
pd.crosstab(azdias_cleaned_encoded['PLZ8_BAUMAX'], azdias_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'] )
azdias_cleaned_encoded['PLZ8_BAUMAX'].value_counts()
azdias_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'].value_counts()
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop('PLZ8_BAUMAX', axis = 1)
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded.head()
pd.crosstab(azdias_cleaned_encoded.PLZ8_BAUMAX_BLDNG_TYPE_FAMILY, azdias_cleaned_encoded.PLZ8_BAUMAX_FAMILY_HOMES)
# 3.7. WOHNLAGE
# Neighborhood quality (or rural flag)
#
# -1: unknown
# 0: no score calculated
# 1: very good neighborhood
# 2: good neighborhood
# 3: average neighborhood
# 4: poor neighborhood
# 5: very poor neighborhood
# 7: rural neighborhood
# 8: new building in rural neighborhood
# note: 0 means no score calculated. In feature_info file, this should have been included into a code for missing/uknown(?)
#explore
print(azdias_cleaned_encoded["WOHNLAGE"].value_counts())
print()
#explore nan
print(azdias_cleaned_encoded["WOHNLAGE"].isna().sum())
condition = azdias_cleaned_encoded["WOHNLAGE"].isin([7,8])
azdias_cleaned_encoded["WOHNLAGE_RURAL_FLAG"] = np.where(condition==True, 1,0)
pd.crosstab(azdias_cleaned_encoded["WOHNLAGE"], azdias_cleaned_encoded["WOHNLAGE_RURAL_FLAG"] )
condition = azdias_cleaned_encoded["WOHNLAGE"].isin([7,8])
azdias_cleaned_encoded["WOHNLAGE_CITY_NEIGHBOURHOOD"] = np.where(condition==True, 0,azdias_cleaned_encoded["WOHNLAGE"])
pd.crosstab(azdias_cleaned_encoded["WOHNLAGE"], azdias_cleaned_encoded["WOHNLAGE_CITY_NEIGHBOURHOOD"])
azdias_cleaned_encoded["WOHNLAGE"].value_counts()
azdias_cleaned_encoded["WOHNLAGE_CITY_NEIGHBOURHOOD"].value_counts()
# drop the original column
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['WOHNLAGE'], axis = 1)
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded.head()
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['LP_LEBENSPHASE_FEIN','LP_LEBENSPHASE_GROB'], axis = 1)
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded.head()
(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding mixed-value features. Which ones did you keep, which did you drop, and what engineering steps did you perform?)
LP_LEBENSPHASE_FEIN. There are too many dimensions (age, income, home owenrship, family status, parenthood) and levels here. Some of these dimensions are present in other variables just to keep things simple, I dropped it.
LP_LEBENSPHASE_GROB. Family component is already present in other variables, such as LP_FAMILIE_FEIN. Income has only two levels: 1) low and avg & 2) high. There are other variables which indicate income, such as LP_STATUS_FEIN, SP_STATUS_GROS, HH_EINKOMMEN_SCORE. Dropped this column.
PRAEGENDE_JUGENDJAHRE. Converted to PRAEGENDE_JUGENDJAHRE_DECADE and PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE (dummy variable) and PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM (dummy variable).
WOHNLAGE. Converted to WOHNLAGE_RURAL_FLAG (dummy variable) and WOHNLAGE_CITY_NEIGHBOURHOOD
CAMEO_INTL_2015. Converted to CAMEO_INTL_2015_WEALTH and CAMEO_INTL_2015_LIFE_STAGE_TYP.
KBA05_BAUMAX This one was removed earlier as an outlier column with too much missing data.
PLZ8_BAUMAX Converted to PLZ8_BAUMAX_BLDNG_TYPE_BUSINESS (dummy variable), PLZ8_BAUMAX_BLDNG_TYPE_FAMILY (dummy variable), PLZ8_BAUMAX_FAMILY_HOMES.
In order to finish this step up, you need to make sure that your data frame now only has the columns that you want to keep. To summarize, the dataframe should consist of the following:
Make sure that for any new columns that you have engineered, that you've excluded the original columns from the final dataset. Otherwise, their values will interfere with the analysis later on the project. For example, you should not keep "PRAEGENDE_JUGENDJAHRE", since its values won't be useful for the algorithm: only the values derived from it in the engineered features you created should be retained. As a reminder, your data should only be from the subset with few or no missing values.
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded = azdias_cleaned_encoded.drop(['NAN_count','many_missing_values_in_row'], axis = 1)
print(azdias_cleaned_encoded.shape)
azdias_cleaned_encoded.info()
for item in list(mixed_vars['attribute']):
print(item)
try:
print(azdias_cleaned_encoded[item])
except:
print('this var has been deleted and/or recoded')
print()
for item in multilevel:
print(item)
try:
print(azdias_cleaned_encoded[item])
except:
print('this var has been deleted and/or recoded')
print()
for item in ['OST_WEST_KZ']:
print(item)
try:
print(azdias_cleaned_encoded[item])
except:
print('this var has been deleted and/or recoded')
print()
for item in ['PRAEGENDE_JUGENDJAHRE', 'CAMEO_INTL_2015']:
print(item)
try:
print(azdias_cleaned_encoded[item])
except:
print('this var has been deleted and/or recoded')
print()
Even though you've finished cleaning up the general population demographics data, it's important to look ahead to the future and realize that you'll need to perform the same cleaning steps on the customer demographics data. In this substep, complete the function below to execute the main feature selection, encoding, and re-engineering steps you performed above. Then, when it comes to looking at the customer data in Step 3, you can just run this function on that DataFrame to get the trimmed dataset in a single step.
def clean_data(df):
"""
Perform feature trimming, re-encoding, and engineering for demographics
data
INPUT: Demographics DataFrame
OUTPUT: Trimmed and cleaned demographics DataFrame
NOTE: This function does NOT build a cleaning procedure from scratch.
It reuses the data transformation functions from previous sections.
This will help avoid code repetition and make this function more readable and easy to maintain.
"""
###########################################################################
# Put in code here to execute all main cleaning steps:
# convert missing value codes into NaNs, ...
# df = azdias.copy() # test - remove later. Keep commented out in PROD
df_cleaned = replace_coded_as_missing_unknown_with_NANs(df)
# summarize NAN for columns
df_cleaned_NA_report = how_many_NA(df_cleaned)
# remove selected columns and rows, ...
# Investigate patterns in the amount of missing data in each column.
# the following columns have more than 1/3 of data missing
outliers_df = df_cleaned_NA_report[df_cleaned_NA_report['missing_NA_percent']>0.333]
# delete outlier columns with too much missing data
for outlier_column in outliers_df['Column'].tolist():
del(df_cleaned[outlier_column])
###########################################################################
# add a column to the df_cleaned with NAN count for each row - takes 3-5 mins to run
count_NAN_in_each_ROW(df_cleaned)
# divide into subsets
df_cleaned_missing_few, df_cleaned_missing_many, threshold = divide_into_subsets(df_cleaned, 30)
# select, re-encode, and engineer column values.
#########################################################
# Make dummy variables for OST_WEST_KZ
df_cleaned_encoded = pd.concat([df_cleaned_missing_few, pd.get_dummies(df_cleaned_missing_few['OST_WEST_KZ'], prefix='OST_WEST_KZ')], axis=1)
#########################################################
# recode PRAEGENDE_JUGENDJAHRE
# initialize column
df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_DECADE'] = 0
my_dict = {
40:[1,2],
50:[3,4],
60:[5,6,7],
70:[8,9],
80:[10,11,12,13],
90:[14,15]
}
print(my_dict)
recode_column_according_2_my_dict(df_cleaned_encoded, "PRAEGENDE_JUGENDJAHRE", "PRAEGENDE_JUGENDJAHRE_DECADE",my_dict)
print(pd.crosstab(df_cleaned_encoded["PRAEGENDE_JUGENDJAHRE"], df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_DECADE']))
df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'] = ""
my_dict ={
"AVANTGARDE": [2, 4, 6, 7, 9, 11, 13, 15],
"MAINSTREAM": [1, 3, 5, 8, 10, 12, 14]
}
recode_column_according_2_my_dict(df_cleaned_encoded, "PRAEGENDE_JUGENDJAHRE", "PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP",my_dict)
# Make dummy variables for PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP
df_cleaned_encoded = pd.concat([df_cleaned_encoded, pd.get_dummies(df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], prefix='PRAEGENDE_JUGENDJAHRE_MOVEMENT')], axis=1)
print(pd.crosstab(df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_AVANTGARDE']))
print(pd.crosstab(df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP'], df_cleaned_encoded['PRAEGENDE_JUGENDJAHRE_MOVEMENT_MAINSTREAM']))
# PRAEGENDE_JUGENDJAHRE_MOVEMENT_ is caused by missing values, we don't need this col, it just tell u number of NAN in the original column PRAEGENDE_JUGENDJAHRE
# PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP is a temporary column, we don't need it either
#########################################################
df_cleaned_encoded["CAMEO_INTL_2015_WEALTH"] =df_cleaned_encoded["CAMEO_INTL_2015"].str.slice(0,1)
df_cleaned_encoded["CAMEO_INTL_2015_LIFE_STAGE_TYP"] =df_cleaned_encoded["CAMEO_INTL_2015"].str.slice(1,2)
#########################################################
df_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'] = ""
my_dict ={
"FAMILY": [1,2,3,4],
"BUSINESS": [5]
}
recode_column_according_2_my_dict(df_cleaned_encoded, "PLZ8_BAUMAX", "PLZ8_BAUMAX_BLDNG_TYPE_TEMP", my_dict)
df_cleaned_encoded = pd.concat([df_cleaned_encoded, pd.get_dummies(df_cleaned_encoded['PLZ8_BAUMAX_BLDNG_TYPE_TEMP'], prefix='PLZ8_BAUMAX_BLDNG_TYPE')], axis=1)
# how many family homes are there?
condition = df_cleaned_encoded['PLZ8_BAUMAX'] == 5
# df_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'] = df_cleaned_encoded['PLZ8_BAUMAX']
df_cleaned_encoded['PLZ8_BAUMAX_FAMILY_HOMES'] = np.where(condition==True, 0, df_cleaned_encoded['PLZ8_BAUMAX'])
#########################################################
condition = df_cleaned_encoded["WOHNLAGE"].isin([7,8])
df_cleaned_encoded["WOHNLAGE_RURAL_FLAG"] = np.where(condition==True, 1,0)
# condition = df_cleaned_encoded["WOHNLAGE"].isin([7,8])
df_cleaned_encoded["WOHNLAGE_CITY_NEIGHBOURHOOD"] = np.where(condition==True, 0,df_cleaned_encoded["WOHNLAGE"])
#########################################################
# the variable multilevel comes from code above which analyzes cat variables
df_cleaned_encoded = df_cleaned_encoded.drop(multilevel, axis = 1)
drop_list = ['OST_WEST_KZ','OST_WEST_KZ_W','PRAEGENDE_JUGENDJAHRE','PRAEGENDE_JUGENDJAHRE_MOVEMENT_TEMP','PRAEGENDE_JUGENDJAHRE_MOVEMENT_','CAMEO_INTL_2015','PLZ8_BAUMAX_BLDNG_TYPE_TEMP','PLZ8_BAUMAX_BLDNG_TYPE_','PLZ8_BAUMAX','WOHNLAGE','LP_LEBENSPHASE_FEIN','LP_LEBENSPHASE_GROB','NAN_count','many_missing_values_in_row']
df_cleaned_encoded = df_cleaned_encoded.drop(drop_list, axis = 1)
# Return the cleaned dataframe.
return df_cleaned_encoded
azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION = clean_data(azdias)
# I want to make my cleaning function returns the same resutls as my cleaning steps abovel
# If it says True below, then we're good
azdias_cleaned_encoded.equals(azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION)
azdias_cleaned_encoded.head()
azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION.head()
azdias_cleaned_encoded.shape
azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION.shape
azdias_cleaned_encoded.info()
azdias_cleaned_encoded_TEST_BIGASS_CLEANING_FUNCTION.info()
Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. Starting from this part of the project, you'll want to keep an eye on the API reference page for sklearn to help you navigate to all of the classes and functions that you'll need. In this substep, you'll need to check the following:
.fit_transform() method to both fit a procedure to the data as well as apply the transformation to the data at the same time. Don't forget to keep the fit sklearn objects handy, since you'll be applying them to the customer demographics data towards the end of the project.# how many columns have missing values
azdias_cleaned_encoded.isnull().any().sum()
azdias_cleaned_encoded.isnull().sum()
# this is a test of using StandardScaler on a dataset with missing values, ignoring NAN
#Source: https://stackoverflow.com/questions/50897516/assigning-nan-to-1-after-performing-standardscaler
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
# The standard score of a sample x is calculated as:
# z = (x - u) / s
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
#Create a dataframe
df = pd.DataFrame({'col1': [0, np.nan, 2, 3, np.nan, 4, 5, np.nan, 6, 7, np.nan]})
print(df)
#Get the index of null values
null_values = df['col1'].isnull()
df_copy = df.copy()
#Perform standard scalar on only non-NaN values
df_copy.loc[~null_values, ['col1']] = StandardScaler().fit_transform(df.loc[~null_values, ['col1']])
print(df_copy)
#https://medium.com/@seb231/principal-component-analysis-with-missing-data-9e28f440ce93
# https://scikit-learn.org/dev/modules/impute.html
# https://stackoverflow.com/questions/29420737/pca-with-missing-values-in-python
# my understanding that even if i were to do it, i may run into issues later on while conducing PCA
azdias_cleaned_encoded.head()
azdias_cleaned_encoded_NA_report = how_many_NA(azdias_cleaned_encoded)
azdias_cleaned_encoded_NA_report
# columns with missing data
azdias_cleaned_encoded_NA_report[azdias_cleaned_encoded_NA_report['missing_NA'] != 0].sort_values(by='missing_NA_percent',ascending=False)
# list of columns with missing data
list_of_columns_with_missing_data = list(azdias_cleaned_encoded_NA_report[azdias_cleaned_encoded_NA_report['missing_NA'] != 0]['Column'])
list_of_columns_with_missing_data
# the purpose of the below is to undersntannd how best to impute missing values.
# bar graphs for columns with missing data
# look and see which are good candidates for imputation with most frequent value. Avg?
bar_graph_for_each_column(azdias_cleaned_encoded, list_of_columns_with_missing_data)
ANZ_HH_TITEL, 0 should have been replaced by NAN, but the feature into doc didn't say anything about 0s here (?)
azdias_cleaned_encoded.head()
# Imputer produces an array, not a dataframe. I'll lose my columns I'll need to rebuild the df.
# So I need to save my column names
columns_list = list(azdias_cleaned_encoded.columns)
print(type(columns_list))
columns_list
"""
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
https://stackoverflow.com/questions/25017626/predicting-missing-values-with-scikit-learns-imputer-module
https://stackoverflow.com/questions/52384806/imputer-on-some-columns-in-a-dataframe
"""
imputer = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
azdias_cleaned_encoded_imputed = imputer.fit_transform(azdias_cleaned_encoded)
print(type(azdias_cleaned_encoded_imputed))
azdias_cleaned_encoded_imputed = pd.DataFrame(azdias_cleaned_encoded_imputed,columns=columns_list)
print(type(azdias_cleaned_encoded_imputed))
azdias_cleaned_encoded_imputed.head()
# number of cols with nan
azdias_cleaned_encoded_imputed.isnull().any().sum()
# we now should have no NANs
azdias_cleaned_encoded_imputed.isnull().sum()
# checking how mode imputation worked
print(azdias_cleaned_encoded['KKK'].isna().sum())
print(azdias_cleaned_encoded['KKK'].value_counts())
print(azdias_cleaned_encoded_imputed['KKK'].isna().sum())
print(azdias_cleaned_encoded_imputed['KKK'].value_counts())
# KKK
64910 + 273024
len(list_of_columns_with_missing_data)
# checking how mode imputation worked
# NAN count + mode frequency in the initial dataset = mode frequency in IMPUTED dataset
x = 10 # give a number from 0 to 33. This will indicate which col with missing value you're checking
test = list_of_columns_with_missing_data[x]
print("checking column:")
print(test)
print()
print('Before imputation:')
print('NAN', azdias_cleaned_encoded[test].isna().sum())
print(azdias_cleaned_encoded[test].value_counts())
print()
print('After imputation:')
print('NAN', azdias_cleaned_encoded_imputed[test].isna().sum())
print(azdias_cleaned_encoded_imputed[test].value_counts())
print()
if azdias_cleaned_encoded[test].isna().sum() + max(azdias_cleaned_encoded[test].value_counts())== max(azdias_cleaned_encoded_imputed[test].value_counts()):
print('All good :) ')
print(azdias_cleaned_encoded[test].isna().sum(), '+', max(azdias_cleaned_encoded[test].value_counts()), '==', max(azdias_cleaned_encoded_imputed[test].value_counts()))
else:
print('Smth is wrong :( )')
# PLZ8_BAUMAX_FAMILY_HOMES
23361 + 499550
# ALTERSKATEGORIE_GROB
2803 + 310466
"""
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
https://stackoverflow.com/questions/40758562/can-anyone-explain-me-standardscaler
https://stackoverflow.com/questions/24645153/pandas-dataframe-columns-scaling-with-sklearn
"""
scaler = StandardScaler()
azdias_scaled = scaler.fit_transform(azdias_cleaned_encoded_imputed)
print(type(azdias_scaled))
azdias_scaled = pd.DataFrame(azdias_scaled, columns=columns_list)
print(type(azdias_scaled))
azdias_scaled.head()
azdias_cleaned_encoded_imputed.describe().transpose()
# mean close to zero. Stdev is close to 1
azdias_scaled.describe().transpose()
(Double-click this cell and replace this text with your own text, reporting your decisions regarding feature scaling.)
Imputed columns with most frequent value, because a lot of them they have a more or less prominent most frequent value (with a high frequency relative to other levels, without a close number two). I appears that mode is a goood option to impute. Many columns have a good pronounced mode, with no close 2nd most freq value. Not all columns have that. You can argue that maybe median may be better for some columns (e.g., KBA13_ANZAHL_PKW), but just to keep things simple I'll use mode.
Scaled data with StandardScaler.
On your scaled data, you are now ready to apply dimensionality reduction techniques.
plot() function. Based on what you find, select a value for the number of transformed features you'll retain for the clustering part of the project.# Apply PCA to the data.
def do_pca(data_scaled,n_components=None):
'''
Source: Udacity Nanodegree Unsupervised Learning 4 Dimensionality Reduction and PCA
Assumes data is already scaled.
INPUT: n_components - int - the number of principal components to create
data - the data you would like to transform
OUTPUT: pca - the pca object created after fitting the data
X_pca - the transformed X matrix with new number of components
'''
pca = PCA(n_components,random_state=42)
X_pca = pca.fit_transform(data_scaled)
return pca, X_pca
pca, X_pca = do_pca(azdias_scaled)
print(azdias_scaled.shape)
print(X_pca.shape)
type(X_pca)
X_pca
pca
# np.arange(0,1.1, 0.1)
# turn off scientific notation
# Credit:
# https://twitter.com/vboykis
pd.set_option('display.float_format', lambda x: '%.3f' % x)
# look at each PCA and how much variance it explains
num_components=len(pca.explained_variance_ratio_)
ind = np.arange(num_components)+1
vals = pca.explained_variance_ratio_
cumvals = np.cumsum(vals)
test_pca_df = pd.DataFrame({'pca': ind, 'vals': vals, 'cumvals': cumvals})
test_pca_df
def look_at_pca_and_variance(x):
y = min(np.where(np.cumsum(pca.explained_variance_ratio_)>=x)[0]+1)
print('To explain', x, 'of variance, we need', y, 'PCAs')
return y
look_at_pca_and_variance(0.6)
look_at_pca_and_variance(0.7)
look_at_pca_and_variance(0.8)
# num_components=len(pca.explained_variance_ratio_)
# ind = np.arange(num_components)+1
# ind
# np.arange(0,71, 5)
# Investigate the variance accounted for by each principal component.
def scree_plot(pca):
'''
Source: Udacity Nanodegree Unsupervised Learning 4 Dimensionality Reduction and PCA
https://stackoverflow.com/questions/12608788/changing-the-tick-frequency-on-x-or-y-axis-in-matplotlib
Creates a scree plot associated with the principal components
INPUT: pca - the result of instantian of PCA in scikit learn
OUTPUT:
None
'''
num_components=len(pca.explained_variance_ratio_)
ind = np.arange(num_components)+1
vals = pca.explained_variance_ratio_
plt.figure(figsize=(15,10))
ax = plt.subplot(111)
cumvals = np.cumsum(vals)
ax.bar(ind, vals)
ax.plot(ind, cumvals)
for i in range(num_components):
ax.annotate(r"%s%%" % ((str(vals[i]*100)[:4])), (ind[i]+0.2, vals[i]), va="bottom", ha="center", fontsize=12)
ax.xaxis.set_tick_params(width=0)
ax.yaxis.set_tick_params(width=2, length=12)
plt.yticks(np.arange(0,1.1, 0.1)) # I added this to see every 10% on the y axis
plt.xticks(np.arange(0,71, 5))
# draw lines to understand how many PCAs I would need
X = 0.6
Y = look_at_pca_and_variance(X)
plt.hlines(y=X, xmin=0, xmax=Y, color='red', linestyles='dashed',zorder=1)
plt.vlines(x=Y, ymin=0, ymax=X, color='red', linestyles='dashed',zorder=2)
X = 0.7
Y = look_at_pca_and_variance(X)
plt.hlines(y=X, xmin=0, xmax=Y, color='red', linestyles='dashed',zorder=3)
plt.vlines(x=Y, ymin=0, ymax=X, color='red', linestyles='dashed',zorder=4)
X = 0.8
Y = look_at_pca_and_variance(X)
plt.hlines(y=X, xmin=0, xmax=Y, color='red', linestyles='dashed',zorder=3)
plt.vlines(x=Y, ymin=0, ymax=X, color='red', linestyles='dashed',zorder=4)
ax.set_xlabel("Principal Component")
ax.set_ylabel("Variance Explained (%)")
plt.title('Explained Variance Per Principal Component')
scree_plot(pca)
# Percentage of variance explained
pca.explained_variance_ratio_.sum()
# Re-apply PCA to the data while selecting for number of components to retain.
pca, X_pca = do_pca(azdias_scaled,n_components=23)
print(azdias_scaled.shape)
print(X_pca.shape)
scree_plot(pca)
# Percentage of variance explained
pca.explained_variance_ratio_.sum()
type(azdias_scaled)
(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding dimensionality reduction. How many principal components / transformed features are you retaining for the next step of the analysis?)
I would take 23 components because they would let me explain almost 80% of variability in the data.
https://stackoverflow.com/questions/12067446/how-many-principal-components-to-take
https://blogs.sas.com/content/iml/2017/08/02/retain-principal-components.html
https://www.researchgate.net/post/How_many_components_can_I_retrieve_in_principal_component_analysis
Now that we have our transformed principal components, it's a nice idea to check out the weight of each variable on the first few components to see if they can be interpreted in some fashion.
As a reminder, each principal component is a unit vector that points in the direction of highest variance (after accounting for the variance captured by earlier principal components). The further a weight is from zero, the more the principal component is in the direction of the corresponding feature. If two features have large weights of the same sign (both positive or both negative), then increases in one tend expect to be associated with increases in the other. To contrast, features with different signs can be expected to show a negative correlation: increases in one variable should result in a decrease in the other.
# Map weights for the first principal component to corresponding feature names
# and then print the linked values, sorted by weight.
# HINT: Try defining a function here or in a new cell that you can reuse in the
# other cells.
def pca_results(full_dataset, pca,number):
'''
Source: Udacity Nanodegree Unsupervised Learning 4 Dimensionality Reduction and PCA
https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot
https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html
Create a DataFrame of the PCA results
Includes dimension feature weights and explained variance
Visualizes the PCA results
'''
number_minus_1 = number-1
# Dimension indexing
dimensions = dimensions = ['Dimension {}'.format(i) for i in range(1,len(pca.components_)+1)]
# PCA components
components = pd.DataFrame(np.round(pca.components_, 4), columns = full_dataset.keys())
components.index = dimensions
# PCA explained variance
ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1)
variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
variance_ratios.index = dimensions
# Create a bar plot visualization
fig, ax = plt.subplots(figsize = (10,15))
# Plot the feature weights as a function of the components
components.iloc[[number_minus_1]].plot(ax = ax, kind = 'bar', width =3);
ax.set_ylabel("Feature Weights")
#create ticks for X axis
dimension_of_interest = []
dimension_of_interest.append(dimensions[number_minus_1])
ax.set_xticklabels(dimension_of_interest, rotation=0)
ax.set_xlabel("Dataset Variables")
# title
title = "Feature Weights for Dimension " + str(number)
ax.set_title(title)
# Display the explained variance ratios
ax.text(-0.1, ax.get_ylim()[1] + 0.05,s="Explained Variance\n %.4f"%(pca.explained_variance_ratio_[number_minus_1]))
ax.legend(bbox_to_anchor=(1.1, 1.05)) # move legend outta the graph
return pd.concat([variance_ratios, components], axis = 1)
my_pca_results = pca_results(azdias_scaled, pca, 1)
# Map weights for the second principal component to corresponding feature names
# and then print the linked values, sorted by weight.
my_pca_results = pca_results(azdias_scaled, pca, 2)
# Map weights for the third principal component to corresponding feature names
# and then print the linked values, sorted by weight.
my_pca_results = pca_results(azdias_scaled, pca, 3)
def plot_pca(data, pca, n_compo):
'''
The above visualizations are too large to understand. Vis below only shows top vars.
Adapted from:
https://github.com/MiguelAMartinez/identify-customer-segments-Arvato/blob/master/Identify_Customer_Segments.ipynb
Plot the features with the most absolute variance for given pca component.
'''
compo = pd.DataFrame(pca.components_, columns = data.keys()).iloc[n_compo-1]
compo.sort_values(ascending=False, inplace=True)
compo = pd.concat([compo.head(7), compo.tail(7)])
fig, ax = plt.subplots(figsize = (10,7))
compo.plot(kind='barh', title='Component ' + str(n_compo))
ax = plt.gca()
ax.grid(linewidth='0.5', alpha=0.5)
ax.set_axisbelow(True)
plt.show()
plot_pca(azdias_scaled, pca,1)
# I wasn't sure if i should supply a scaled dataset or the non-scated one. It doesn't matter cus the results will be equal
# my_pca_results_1 = pca_results(azdias_cleaned_encoded_imputed, pca, 1)
# my_pca_results_2 = pca_results(azdias_scaled, pca, 1)
# my_pca_results_1.equals(my_pca_results_2)
# True
plot_pca(azdias_scaled, pca,2)
plot_pca(azdias_scaled, pca,3)
my_pca_results.head()
# temp_df = my_pca_results.head()
# temp_df.to_excel("feature_weights.xlsx")
my_pca_results
(Double-click this cell and replace this text with your own text, reporting your observations from detailed investigation of the first few principal components generated. Can we interpret positive and negative values from them in a meaningful way?)
This one is hard because it correlates with so many variables
KBA05_ANTG3 high share of 6-10 family homes
KBA05_ANTG4h high share of 10+ family homes Do they live in a densely populated area?
CAMEO_INTL_2015_WEALTH. Positively correlated. High number means poor.
FINANZ_SPARER saver. High number means low affinity. Not saver.
FINANZ_HAUSBAUER home owner. POsitive correlation. High number means low affinity. Not a home owner.
MINIMALIST: low financial interest. Low number means high affinity. So they do have low financial interest, but i'm not sure what that means in this context
Relatively poor people who don't save, don't own a home. Live an a densely populated area. Lower social status
ALTERSKATEGORIE_GROB, age. Older people
LUST: sensual-minded. High number is lowest affinity.
ERL: event-oriented. High number is lowest affinity.
Not sensual-minded, not event-oriented
FINANZ_VORSORGER: be prepared. High number is lowest affinity. Not financially prepared.
SEMIO_REL: religious. Negative correlation. Low number means high affinity. Religious.
FINANZ_SPARER money saver. Negative correlation. Low number means high affinity.
FINANZ_ANLEGER investor. Negative correlation. Low number means high affinity. This contradicts to not financially prepared.
PRAEGENDE_JUGENDJAHRE_DECADE negatively correlates with decade, indicating an older decade. Makes sense. These are older people
PFLICHT: dutiful
TRADV: tradionally-minded. Low score means high affinity
Older people, traditional, dutiful, religious, conservative, not event oriented, not sensually minded. Save and invest money but maybe still think they're not prepard for retirement (this is just an assumption)
SEMIO_VERT, which is dreamful personality
SOZ: socially-minded
FAM: family-minded
Small number indicates high affinity. Large number indicates small affinity. So it's a NEGATIVE correlation then.
They are not socially_minded, family-minded,
Feature 3 negatively correlates with gender ANREDE_KZ, 1 being male and 2 being female. So feature 3 must be associated with males.
negatively correlates with:
FINANZ_ANLEGER - investor
SEMIO_KAEM - combative attitude
SEMIO_KRIT: critical-minded
SEMIO_DOM: dominant-minded
however, the scale there is 1: highest affinity, 7: lowest affinity so the correlation is actually positive then.
Male. Not dreamful, socially minded and family minded. An investor with a combative attitude, critical-minded, dominant.
You've assessed and cleaned the demographics data, then scaled and transformed them. Now, it's time to see how the data clusters in the principal components space. In this substep, you will apply k-means clustering to the dataset and use the average within-cluster distances from each point to their assigned cluster's centroid to decide on a number of clusters to keep.
.score() method might be useful here, but note that in sklearn, scores tend to be defined so that larger is better. Try applying it to a small, toy dataset, or use an internet search to help your understanding.# Use a sample to reduce computation time
# Borrowed from:
# https://github.com/MiguelAMartinez/identify-customer-segments-Arvato/blob/master/Identify_Customer_Segments.ipynb
# https://stackoverflow.com/questions/22994423/difference-between-np-random-seed-and-np-random-randomstate
np.random.seed(42)
# https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice.html
x = 0.25
X_pca_test_subset = X_pca[np.random.choice(X_pca.shape[0], int(X_pca.shape[0]*x), replace=False)]
print('Taking', x*100, 'percent of the data: ' , len(X_pca_test_subset), 'rows')
X_pca_test_subset
X_pca
# len(X_pca) # 798067
# len(X_pca_test_subset)
# 159613/ 798067 # 0.1999994987889488
# 199516/ 798067 # 0.249999060229279
# Over a number of different cluster counts..
# run k-means clustering on the data and...
# compute the average within-cluster distances.
"""https://stackoverflow.com/questions/48607546/k-means-cluster-method-score-negative
https://stackoverflow.com/questions/51138686/how-to-use-silhouette-score-in-k-means-clustering-from-sklearn-library
https://www.datacamp.com/community/tutorials/seaborn-python-tutorial
https://stackoverflow.com/questions/36220829/fine-control-over-the-font-size-in-seaborn-plots-for-academic-papers/36222162
https://stackoverflow.com/questions/16424724/how-can-i-fix-a-memoryerror-when-executing-scikit-learns-silhouette-score
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html
Runtime calculation taken from:
https://github.com/MiguelAMartinez/identify-customer-segments-Arvato/blob/master/Identify_Customer_Segments.ipynb
"""
def choose_k(my_pca_data, k_min, k_max):
start_time = time.time()
with active_session():
list_k = []
list_score = []
for k in range(k_min,k_max+1):
# run k-means
kmeans = KMeans(k, random_state=42)
model = kmeans.fit(my_pca_data)
# vaidate with score(?)
score = model.score(my_pca_data) # I'm not sure about this one, if I'm doing it right.
# Checked other guys' submmissions, seems right. But they use abs() to get the abs value
# print reports
print("k:", k)
print("score:", score)
print("--------------------------------")
print()
# append
list_k.append(k)
list_score.append(score)
df_k = pd.DataFrame(
{'k': list_k,
'score': list_score
})
print(df_k)
# Investigate the change in within-cluster distance across number of clusters.
# HINT: Use matplotlib's plot function to visualize this relationship.
plt.rcParams["figure.figsize"] = (30,10)
ax = sns.barplot(x="k", y="score", data=df_k)
ax.set_title("Number of clusters and score",fontsize=40)
ax.set_xlabel("Number of clusters",fontsize=30)
ax.set_ylabel("score",fontsize=30)
ax.tick_params(labelsize=25)
# Show the plot
plt.show()
print("--- Run time: %s mins ---" % np.round(((time.time() - start_time)/60),2))
# this took half hour to run.
# Taking 25.0 percent of the data: 199516 rows
# Do NOT rerun it again, if you decide to re-run the whole project :)
choose_k(X_pca_test_subset,k_min=2,k_max=30)
# this took 75-90 mins to run
# my understanding is here is how k means run time increases: if you add 10 more rows, it increases by 100 units(??)
# if you increase rows by 10 times, run time will increase by 100 times(?)
# maybe they meant computation complexity
# no, it doesn't look like it here. I increased data x4 (took all data), run time increased by 3 times.
# Using all the data
# Definitely do NOT rerun it again, if you decide to re-run the whole project :)
choose_k(X_pca,k_min=2,k_max=25)
# Re-fit the k-means model with the selected number of clusters and obtain
# cluster predictions for the general population demographics data.
kmeans_10 = KMeans(n_clusters = 10, random_state=42)
model_10 = kmeans_10.fit(X_pca)
clusters_general = model_10.predict(X_pca)
kmeans_10
model_10
clusters_general
len(clusters_general)
# This number is the same as when i ran choose_k(X_pca,k_min=2,k_max=25) on k = 10
score_10 = model_10.score(X_pca)
score_10
clusters = pd.DataFrame ({'clusters_general' : clusters_general})
clusters
clusters['clusters_general'].value_counts().sort_index()
fig, ax = plt.subplots(1, 1,figsize=(10, 7))
clusters['clusters_general'].value_counts().sort_index().plot('barh').invert_yaxis()
# title and axis labels
plt.title('General Population Clusters')
ax.set_ylabel('Cluster')
ax.set_xlabel('Frequency')
# commas for x axis
fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick)
plt.show()
(Double-click this cell and replace this text with your own text, reporting your findings and decisions regarding clustering. Into how many clusters have you decided to segment the population?)
The number of clusters should be manageable for the marketing campaign purposes and easily explainable to stakeholder and CEO. Therefore, clusters more than 10 are not recommended (although it depends on how the clusters would be used). For day to day marketing operations, up to 10 is okay, more than 10 gets too granular and hard to manage. It's not feasible to have more than 10 marketing approaches.
Somewhat arbitrarily, I'll go with 10 clusters.
I do not see an obvious elbow, but after 10 clusters I do not observe a very big improvement in score. Under 10, especially under 7, there is a big improvement in score.
I also ran K-means with 2 to 25 clusters on only 25% of data vs on all data. The relationship between k and score is approximately the same in both runs (subset vs all data), as see in the graph. So it appears that for this dataset, if you want to choose k, you can just run k means on a subset.
Cluster 7 is the biggest.
Now that you have clusters and cluster centers for the general population, it's time to see how the customer data maps on to those clusters. Take care to not confuse this for re-fitting all of the models to the customer data. Instead, you're going to use the fits from the general population to clean, transform, and cluster the customer data. In the last step of the project, you will interpret how the general population fits apply to the customer data.
;) delimited.clean_data() function you created earlier. (You can assume that the customer demographics data has similar meaning behind missing data patterns as the general demographics data.).fit() or .fit_transform() method to re-fit the old objects, nor should you be creating new sklearn objects! Carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.# Load in the customer demographics data.
customers = pd.read_csv("Udacity_CUSTOMERS_Subset.csv", sep=';')
customers.shape
customers.info()
customers.head(10)
customers.tail(10)
# clean and encode
customers_cleaned_encoded = clean_data(customers)
customers_cleaned_encoded.shape
customers_cleaned_encoded.head()
customers_cleaned_encoded.isnull().sum()
# number of columns with missing values
# this number was 34 for azdias as far as i remember
customers_cleaned_encoded.isnull().any().sum()
# Imputer produces an array, not a dataframe. I'll lose my columns I'll need to rebuild the df.
# So I need to save my column names
columns_list_customers = list(customers_cleaned_encoded.columns)
columns_list_customers
# impute
customers_cleaned_encoded_imputed = imputer.transform(customers_cleaned_encoded)
print(type(customers_cleaned_encoded_imputed))
customers_cleaned_encoded_imputed = pd.DataFrame(customers_cleaned_encoded_imputed,columns=columns_list_customers)
print(type(customers_cleaned_encoded_imputed))
customers_cleaned_encoded_imputed.head()
# we now should have no NANs
customers_cleaned_encoded_imputed.isnull().any().sum()
customers_cleaned_encoded_imputed.isnull().sum()
# checking how mode imputation worked
print(customers_cleaned_encoded['KKK'].isna().sum())
print(customers_cleaned_encoded['KKK'].value_counts())
print(customers_cleaned_encoded_imputed['KKK'].isna().sum())
print(customers_cleaned_encoded_imputed['KKK'].value_counts())
# 'KKK'
# 10137+40739 = 50876
10137+40739
# checking how mode imputation worked
print(customers_cleaned_encoded['ALTERSKATEGORIE_GROB'].isna().sum())
print(customers_cleaned_encoded['ALTERSKATEGORIE_GROB'].value_counts())
print(customers_cleaned_encoded_imputed['ALTERSKATEGORIE_GROB'].isna().sum())
print(customers_cleaned_encoded_imputed['ALTERSKATEGORIE_GROB'].value_counts())
I use the sklearn objects from the general demographics data, and apply their transformations to the customers data. That is, we should not be using a .fit() or .fit_transform() method to re-fit the old objects, nor should we be creating new sklearn objects! We should carry the data through the feature scaling, PCA, and clustering steps, obtaining cluster assignments for all of the data in the customer demographics data.
233 + 47377
azdias_cleaned_encoded.ALTERSKATEGORIE_GROB.value_counts()
customers_cleaned_encoded.ALTERSKATEGORIE_GROB.value_counts()
# scale
customers_scaled = scaler.transform(customers_cleaned_encoded_imputed)
print(type(customers_scaled))
customers_scaled = pd.DataFrame(customers_scaled, columns=columns_list_customers)
print(type(customers_scaled))
customers_scaled.head()
customers_cleaned_encoded_imputed.describe().transpose()
# notice how scaling didn't do its job here
customers_scaled.describe().transpose()
I know we were told to NOT use fit_transform.
However, the idea behind StandardScaler is to normalize variables so they have a mean of 0 and stdev of 1.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
https://stackoverflow.com/questions/40758562/can-anyone-explain-me-standardscaler
If I just use transform without fitting, this doesn't work on customer data, it doesn't get properly normalized to mean of 0 and stdev of 1.
I will diverge from the instructions here and refit the existing scaler to the customer data, so it's normalized properly. Or else I'm worried k-means won't work correctly.
I get the idea is to build a pipeline and to run customer data thru the pipeline, but I'm concerned by the fact that scaling doesn't appear to have worked properly in this case for customers data.
I will still run K-means after "transform" or "fit_transform" and see if clusters are affected by it.
Customer data did get closer to 0 and 1 though after scaling it with general population scaler.
# this code is just like the one a few cells above,
# but I"m re-fitting scaler to customer data to normalize it to mean of 0 and stdev of 1
# scale
"""
customers_scaled was obtained thru transform (using general population scaler)
customers_scaled_1 was obtained thru fit_transform (using customer scaler)
"""
customers_scaled_1 = scaler.fit_transform(customers_cleaned_encoded_imputed)
print(type(customers_scaled_1))
customers_scaled_1 = pd.DataFrame(customers_scaled_1, columns=columns_list_customers)
print(type(customers_scaled_1))
customers_scaled_1.head()
# now it's scaled nicely to mean of 0 and stdev of 1
customers_scaled_1.describe().transpose()
pca
def cluster_customer_data(my_customer_scaled_data, my_graph_title):
"""do PCA, k means prediction
data scaled with general population scaler with transform
"""
# PCA
customers_pca = pca.transform(my_customer_scaled_data)
print(type(customers_pca))
print()
print('customers_pca')
print(customers_pca)
print()
# predict with kmeans
preds_customers = model_10.predict(customers_pca)
print()
print('preds_customers')
print(preds_customers)
print()
# make a df with clusters
clusters_customers = pd.DataFrame ({'clusters_customers' : preds_customers})
print('clusters_customers')
print(clusters_customers.head(30))
print()
print('clusters_customers summary')
print(clusters_customers['clusters_customers'].value_counts().sort_index())
#visualize
fig, ax = plt.subplots(1, 1,figsize=(10, 7))
clusters_customers['clusters_customers'].value_counts().sort_index().plot('barh').invert_yaxis()
# title and axis labels
plt.title(my_graph_title)
ax.set_ylabel('Cluster')
ax.set_xlabel('Frequency')
# commas for x axis
fmt = '{x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.xaxis.set_major_formatter(tick)
plt.show()
return clusters_customers
clusters_customers = cluster_customer_data(customers_scaled,
'CUSTOMER CLUSTERS - SCALED WITH GENERAL POPULATION SCALER WITH TRANSFORM')
clusters_customers = cluster_customer_data(customers_scaled_1,
'CUSTOMER CLUSTERS - SCALED WITH CUSTOMER SCALER WITH FIT_TRANSFORM')
At this point, you have clustered data based on demographics of the general population of Germany, and seen how the customer data for a mail-order sales company maps onto those demographic clusters. In this final substep, you will compare the two cluster distributions to see where the strongest customer base for the company is.
Consider the proportion of persons in each cluster for the general population, and the proportions for the customers. If we think the company's customer base to be universal, then the cluster assignment proportions should be fairly similar between the two. If there are only particular segments of the population that are interested in the company's products, then we should see a mismatch from one to the other. If there is a higher proportion of persons in a cluster for the customer data compared to the general population (e.g. 5% of persons are assigned to a cluster for the general population, but 15% of the customer data is closest to that cluster's centroid) then that suggests the people in that cluster to be a target audience for the company. On the other hand, the proportion of the data in a cluster being larger in the general population than the customer data (e.g. only 2% of customers closest to a population centroid that captures 6% of the data) suggests that group of persons to be outside of the target demographics.
Take a look at the following points in this step:
countplot() or barplot() function could be handy..inverse_transform() method of the PCA and StandardScaler objects to transform centroids back to the original data space and interpret the retrieved values directly.# this is just testing. Again, as described above, "transform" doesn't work well. "fit_transform" works fine
# keep this one commented out, dont use for analysis
# clusters_customers = cluster_customer_data(customers_scaled,
# 'CUSTOMER CLUSTERS - SCALED WITH GENERAL POPULATION SCALER WITH TRANSFORM')
# use this one for analysis
# clusters_customers = cluster_customer_data(customers_scaled_1,
# 'CUSTOMER CLUSTERS - SCALED WITH CUSTOMER SCALER WITH FIT_TRANSFORM')
# too many missing values in general population
print(len(azdias))
print(len(azdias_cleaned_encoded))
azdias_too_many_NA_in_rows = len(azdias) - len(azdias_cleaned_encoded)
print(azdias_too_many_NA_in_rows)
# too many missing values in customers
print(len(customers))
print(len(customers_cleaned_encoded))
customers_too_many_NA_in_rows = len(customers) - len(customers_cleaned_encoded)
print(customers_too_many_NA_in_rows)
# general pop clusters
summary_general = clusters['clusters_general'].value_counts().sort_index()
summary_general
# customer clusters
summary_customers = clusters_customers.clusters_customers.value_counts().sort_index()
summary_customers
# create a dataframe with general population and customer clusters
# https://stackoverflow.com/questions/18062135/combining-two-series-into-a-dataframe-in-pandas
summary_comparison = pd.concat([summary_general, summary_customers], axis =1)
print(type(summary_comparison))
summary_comparison.reset_index(inplace=True)
summary_comparison.columns = ['cluster','general','customers']
summary_comparison
# visualize gen pop and cust clusters
fig, ax = plt.subplots(1, 2,figsize=(15, 7))
plt.subplot(1,2,1)
plt.bar(summary_comparison['cluster'], summary_comparison['general'], align='center', alpha=0.5)
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('People', fontsize=14)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
title = 'General Population Clusters'
plt.title(title, fontsize=18)
plt.subplot(1,2,2)
plt.bar(summary_comparison['cluster'], summary_comparison['customers'], align='center', alpha=0.5)
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('People', fontsize=14)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
title = 'Customer Clusters'
plt.title(title, fontsize=18)
plt.show()
# add clusters which are data with lots of missing rows
# https://thispointer.com/python-pandas-how-to-add-rows-in-a-dataframe-using-dataframe-append-loc-iloc/
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
summary_comparison_with_NAN_cluster = summary_comparison.append({'cluster' : -1,
'general': azdias_too_many_NA_in_rows,
'customers':customers_too_many_NA_in_rows},
ignore_index=True).set_index('cluster', drop=False)
summary_comparison_with_NAN_cluster
# visualize gen pop and cust clusters again. This time with NAN / null cluster
fig, ax = plt.subplots(1, 2,figsize=(15, 7))
plt.subplot(1,2,1)
plt.bar(summary_comparison_with_NAN_cluster['cluster'], summary_comparison_with_NAN_cluster['general'], align='center', alpha=0.5)
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('People', fontsize=14)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
title = 'General Population Clusters. -1 is lots of NAN.'
plt.title(title, fontsize=18)
plt.subplot(1,2,2)
plt.bar(summary_comparison_with_NAN_cluster['cluster'], summary_comparison_with_NAN_cluster['customers'], align='center', alpha=0.5)
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('People', fontsize=14)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
title = 'Customer Clusters. -1 is lots of NAN.'
plt.title(title, fontsize=18)
plt.show()
# all clusters for gen pop and cust, incuding NAN
summary_comparison_with_NAN_cluster
# https://stackoverflow.com/questions/43217916/pandas-data-precision
# show only two decimal points
pd.set_option('display.float_format', '{:0.4f}'.format)
# gen_perc & cust_perc show proportion of clusters
# diff is difference
summary_comparison_with_NAN_cluster['gen_perc'] = summary_comparison_with_NAN_cluster['general'] / len(azdias)
summary_comparison_with_NAN_cluster['cust_perc'] = summary_comparison_with_NAN_cluster['customers'] / len(customers)
summary_comparison_with_NAN_cluster['diff'] =summary_comparison_with_NAN_cluster['cust_perc']- summary_comparison_with_NAN_cluster['gen_perc']
summary_comparison_with_NAN_cluster
print("Clusters from most likely ones to be target audience to least likely ones")
summary_comparison_with_NAN_cluster.sort_values(by='diff',ascending= False)
# 4 is the most likely to be target audience
# 3 is outside of the target audience
# https://stackoverflow.com/questions/42128467/matplotlib-plot-multiple-columns-of-pandas-data-frame-on-the-bar-chart
# https://stackoverflow.com/questions/51174691/how-to-increase-image-size-of-pandas-dataframe-plot-in-jupyter-notebook/51174822
summary_comparison_with_NAN_cluster.plot(x='cluster', y=['gen_perc', 'cust_perc'],
kind="bar",
figsize=(10,5),
alpha=0.5,
color=['#00aad2', '#ff7900'])
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('Proportion', fontsize=14)
plt.xticks(fontsize=15, rotation=0)
plt.yticks(fontsize=15)
title = 'Clusters Proportion: Gen Pop vs Customers'
plt.title(title, fontsize=18)
plt.show()
print('Overrepresented in customers')
summary_comparison_with_NAN_cluster[summary_comparison_with_NAN_cluster['diff']>0].sort_values(by='diff',ascending= False)
summary_comparison_with_NAN_cluster[summary_comparison_with_NAN_cluster['diff']>0].sort_values(by='diff',
ascending= False).plot(
x='cluster', y=['gen_perc', 'cust_perc'],
kind="bar",
figsize=(10,5),
alpha=0.5,
color=['#00aad2', '#ff7900'])
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('Proportion', fontsize=14)
plt.xticks(fontsize=15, rotation=0)
plt.yticks(fontsize=15)
title = 'Overrepresented in Customers'
plt.title(title, fontsize=18)
plt.show()
summary_comparison_with_NAN_cluster[summary_comparison_with_NAN_cluster['diff']<0].sort_values(by='diff',ascending= True).plot(
x='cluster', y=['gen_perc', 'cust_perc'],
kind="bar",
figsize=(10,5),
alpha=0.5,
color=['#00aad2', '#ff7900'])
plt.xlabel('Clusters', fontsize=14)
plt.ylabel('Proportion', fontsize=14)
plt.xticks(fontsize=15, rotation=0)
plt.yticks(fontsize=15)
title = 'Underrepresented in Customers'
plt.title(title, fontsize=18)
plt.show()
print('Underrepresented in customers')
summary_comparison_with_NAN_cluster[summary_comparison_with_NAN_cluster['diff']<0].sort_values(by='diff',ascending= False)
pca.inverse_transform(model_10.cluster_centers_)
scaler.inverse_transform(pca.inverse_transform(model_10.cluster_centers_))
# What kinds of people are part of a cluster that is overrepresented and underrepresented in the
# customer data compared to the general population?
# adapted from:
# https://github.com/chauhan-nitin/Udacity-IdentifyCustomerSegments-Arvato/blob/master/Identify_Customer_Segments.ipynb
# Cluster center specs can be found below:
cluster_centroids = pd.DataFrame(scaler.inverse_transform(pca.inverse_transform(model_10.cluster_centers_)), columns=columns_list_customers)
cluster_centroids
# look at cluster centroids for 4 (underrepresented) and 3 (overrepresented)
# I exported to Excel, compared Excel vs data dictionary
# cluster_centroids.to_excel('customer_cluster_centroids.xlsx')
(Double-click this cell and replace this text with your own text, reporting findings and conclusions from the clustering analysis. Can we describe segments of the population that are relatively popular with the mail-order company, or relatively unpopular with the company?)
I regret dropping some variables from the analysis, such as LP_STATUS_GROB (social status), SHOPPER_TYP (shopper type), and some other mixed type variables indicating social status and shopping habits. Super important.
These are multilevel vars. I should have just recoded them as dummies. This is very important info for marketing, even critical.
I think even 10 clusters is overkill, hard to analyze and describe. Maybe 7 would have been better for marketing?
The way I was creating calcualated variables was inefficient. I was creating temp vars and then dropping them. I could take a look at how other people were doing this.
I didn't need to run K means with 2,3,4...30 clusters on all data. 25% random sample/subset is sufficient for that.
Some of these are repetitive, or may slightly differ: middle class vs upper middle class etc, because we have several vars describing the same thing. It's good to check if the cluster makes sense and the data is good, to make sure there are no big contradictions
We could also describe cluster 7 and 6
We can also describe clustesrs 9 and 2.
Congratulations on making it this far in the project! Before you finish, make sure to check through the entire notebook from top to bottom to make sure that your analysis follows a logical flow and all of your findings are documented in Discussion cells. Once you've checked over all of your work, you should export the notebook as an HTML document to submit for evaluation. You can do this from the menu, navigating to File -> Download as -> HTML (.html). You will submit both that document and this notebook for your project submission.